Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
نویسنده
چکیده
This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment.
منابع مشابه
Improving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملThe STAR Fault Manager for Distributed Operating Environments. Design, Implementation and Performance
This paper presents the design, implementation, and performance evaluation of a software fault manager for distributed applications. Dubbed ST A R , it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, ST A ...
متن کاملReliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملPhD Research Proposal: Fault Tolerance and Quality of Service in Large-Scale Networked Virtual Environments
The Research Proposal is a part of the project: Middleware Services for Management of Shared State in Large-Scale Distributed Interactive Applications (MiSMoSS). MiSMoSS is funded by the Research Council of Norway and is Project No. 15992/431. The project is expected to lead to three PhD theses supervised by faculty members Carsten Griwodz, Paal Halvorsen and Ellen Munthe-Kaas in the Networks a...
متن کاملCORBA Based Runtime Support for Load Distribution and Fault Tolerance
Parallel scienti c computing in a distributed computing environment based on CORBA requires additional services not (yet) included in the CORBA speci cation: load distribution and fault tolerance. Both of them are essential for long running applications with high computational demands as in the case of computational engineering applications. The proposed approach for providing these services is...
متن کامل